Identification of Cognates and Recurrent Sound Correspondences in Word Lists

نویسنده

  • Grzegorz Kondrak
چکیده

Identification of cognates and recurrent sound correspondences is a component of two principal tasks of historical linguistics: demonstrating the relatedness of languages, and reconstructing the histories of language families. We propose methods for detecting and quantifying three characteristics of cognates: recurrent sound correspondences, phonetic similarity, and semantic affinity. The ultimate goal is to identify cognates and correspondences directly from lists of words representing pairs of languages that are known to be related. The proposed solutions are language independent, and are evaluated against authentic linguistic data. The results of evaluation experiments involving the Indo-European, Algonquian, and Totonac language families indicate that our methods are more accurate than comparable programs, and achieve high precision and recall on various test sets. The results also suggest that combining various types of evidence substantially increases cognate identification accuracy. RÉSUMÉ. L’identification de mots apparentés et des correspondances de sons récurrents intervient dans deux des principales tâches de la linguistique historique: démontrer des filiations linguistiques et reconstruire l’histoire des familles de langues. Nous proposons des méthodes de détection et de quantification de trois caractéristiques des mots apparentés: les correspondances de sons récurrents, la ressemblance phonétique et l’affinité sémantique. Le but ultime est d’identifier les mots apparentés et les correspondances directement à partir de listes de mots représentant des paires des langues dont la filiation est connue. Les solutions proposées sont indépendantes des langues traitées et sont évaluées sur des données linguistiques réelles. Les résultats d’expériences impliquant des langues indo-européennes, algonquines et des langues de la famille du totonaque indiquent que nos méthodes sont plus précises que des programmes comparables et d’atteignent une haute précision et un haut taux de rappel sur des ensembles de test. Les résultats suggèrent également que la combinaison de divers types d’indices augmente grandement la justesse de l’identification des mots apparentés.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identifying Complex Sound Correspondences in Bilingual Wordlists

The determination of recurrent sound correspondences between languages is crucial for the identification of cognates, which are often employed in statistical machine translation for sentence and word alignment. In this paper, an algorithm designed for extracting non-compositional compounds from bitexts is shown to be capable of determining complex sound correspondences in bilingual wordlists. I...

متن کامل

How Many Is Enough?—Statistical Principles for Lexicostatistics

Lexicostatistics has been applied in linguistics to inform phylogenetic relations among languages. There are two important yet not well-studied parameters in this approach: the conventional size of vocabulary list to collect potentially true cognates and the minimum matching instances required to confirm a recurrent sound correspondence. Here, we derive two statistical principles from stochasti...

متن کامل

Determining Recurrent Sound Correspondences by Inducing Translation Models

I present a novel approach to the determination of recurrent sound correspondences in bilingual wordlists. The idea is to relate correspondences between sounds in wordlists to translational equivalences between words in bitexts (bilingual corpora). My method induces models of sound correspondence that are similar to models developed for statistical machine translation. The experiments show that...

متن کامل

Creating a Comparative Dictionary of Totonac-Tepehua

We apply algorithms for the identification of cognates and recurrent sound correspondences proposed by Kondrak (2002) to the Totonac-Tepehua family of indigenous languages in Mexico. We show that by combining expert linguistic knowledge with computational analysis, it is possible to quickly identify a large number of cognate sets within the family. Our objective is to provide tools for rapid co...

متن کامل

Fast and unsupervised methods for multilingual cognate clustering

In this paper we explore the use of unsupervised methods for detecting cognates in multilingual word lists. We use online EM to train sound segment similarity weights for computing similarity between two words. We tested our online systems on geographically spread sixteen different language groups of the world and show that the Online PMI system (Pointwise Mutual Information) outperforms a HMM ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • TAL

دوره 50  شماره 

صفحات  -

تاریخ انتشار 2009